Let's write a cleaning function

You can use Python string functions to do some basic data cleaning. (For data sets with more complex cleaning needs, you might want to use a power tool like Open Refine.)

Here, we're going to write a function that takes in one row of data as input, cleans up the pieces of data in the row, then returns a cleaned version of that row. As we loop over the data in the file, we'll call this function on each row, then write out the clean row to a new file.

Let's break down our tasks:

  • Write a cleaning function that accepts a row of raw data and returns a row of clean data
  • Open our CSV file of raw data
  • Open a CSV file to write the cleaned data into
  • Loop over the rows of raw data, passing each row to our cleaning function
  • Write out the clean data to the new file

The data

We're going to be working with the FDIC's list of failed banks.

Write the cleaning function

First, we need to write our cleaning function -- let's call our function clean_row(). We need to decide whether the row it parses will be a dictionary (using csv.DictReader) or a list (using csv.reader).

Let's use a dictionary.

Here are the fields that we are going to include in our output file. The ones that need cleaning are in bold.

  • Bank Name: Sometimes has extra whitespace, needs to be uppercase, our house style dictates that ampersands should be replaced by the word "and"
  • City: Needs to be uppercase
  • ST
  • Acquiring Institution: Sometimes has extra whitespace, needs to be uppercase, our house style dictates that ampersands should be replaced by the word "and"
  • Closing Date

In [ ]:
# first line defines the function and the argument
# ("row" is an arbitrary variable name)
def clean_row(row):
    
    """
    For the bank and institution name:
        - strip whitespace
        - uppercase the name
        - replace '&' with 'AND'
        
    n.b.: you can chain string methods together
    """
    clean_bank = row['Bank Name'].strip().upper().replace('&', 'AND')
    clean_inst = row['Acquiring Institution'].strip().upper().replace('&', 'AND')

    # strip whitespace and upcase the city
    clean_city = row['City'].strip().upper()
    
    # return a dictionary of clean data
    # the keys ~must~ match the headers of our output file
    return {
        'bank': clean_bank,
        'inst': clean_inst,
        'city': clean_city,
        'st': row['ST'],
        'c_date': row['Closing Date']
    }

Use the cleaning function

Now, in a with block, we'll do the following:

  • Read in data/failed_banks.csv
  • Open banks-clean.csv to write to
  • Loop over the rows of raw data
  • Call the cleaning function on each row
  • Write the returned (clean) data to banks-clean.csv

In [ ]:
# import the csv library
import csv

# open the two files
with open('data/failed-banks.csv', 'r') as infile, open('banks-clean.csv', 'w') as outfile:
    
    # create a DictReader object
    reader = csv.DictReader(infile)
    
    # create a DictWriter object
    # the fieldnames must exactly match the keys in the dictionary being returned
    # by our cleaning function
    writer = csv.DictWriter(outfile, fieldnames=['bank', 'inst', 'city', 'st', 'c_date'])
    
    # write out header row
    writer.writeheader()
    
    # loop over the rows of raw data
    # "row" is an arbitrary variable name
    for row in reader:
        
        # call the cleaning function on the row
        cleaned = clean_row(row)
        
        # write out the clean row
        writer.writerow(cleaned)

Extra credit

The Closing Date field in the bank failure data is in this format: 6-Sep-2011. In other words, day, then abbreviated month as text, then year.

Python's built-in datetime module has two methods that can help us reformat them: strftime() and strptime().

Your task: Add some code to the cleaning function to reformat the closing date in yyyy-mm-dd format. This will require doing some research into a module that we haven't discussed yet. (Good practice for when you're coding on your own.)

Breaking it down into smaller tasks:

  • You'll need to import datetime from the datetime module: from datetime import datetime
  • Then figure out how to use strptime() to turn a 6-Sep-2011-type string into a Python date object
  • Then figure out how to format that date object as yyyy-mm-dd using strftime()
  • Then add that functionality to the cleaning function and re-run the bank data

Google is your friend here. Try searching for things like "python strptime example." (Freebie: Here's a handy guide to the date directives.) Noodle around in a cell. Get something working for one date -- a test string -- before setting your solution loose on the whole file. Try new things, see what happens, fail, find solutions. It's all part of the learning process.


In [ ]:
import csv
from datetime import datetime

def clean_row(row):

    clean_bank = row['Bank Name'].strip().upper().replace('&', 'AND')
    clean_inst = row['Acquiring Institution'].strip().upper().replace('&', 'AND')
    clean_city = row['City'].strip().upper()
    
    # reformat the date
    clean_date = datetime.strptime(row['Closing Date'], '%d-%b-%y').strftime('%Y-%m-%d')

    return {
        'bank': clean_bank,
        'inst': clean_inst,
        'city': clean_city,
        'st': row['ST'],
        'c_date': clean_date
    }

with open('data/failed_banks.csv', 'r') as infile, open('banks-clean.csv', 'w') as outfile:
    
    reader = csv.DictReader(infile)
    writer = csv.DictWriter(outfile, fieldnames=['bank', 'inst', 'city', 'st', 'c_date'])
    
    writer.writeheader()
    
    for row in reader:
        cleaned = clean_row(row)
        writer.writerow(cleaned)

In [ ]: